CI: Use per-machine runner labels for GPU load test by gyohuangxin · Pull Request #299 · ROCm/ATOM

gyohuangxin · 2026-03-10T09:55:46Z

Summary

Use per-machine runner labels (mia1-p01-g33, g34, g40, g42, g45, g64) instead of the shared atom-mi355-8gpu.predownload label
This ensures every physical machine runs the GPU load test exactly once with no duplication
Labels have been added to each runner via the GitHub API

Follows up on #298

Copilot

Pull request overview

Updates the manual GPU load test workflow to target specific self-hosted runner labels per physical machine, avoiding duplicate runs caused by a shared runner label.

Changes:

Replaced the single shared runner label in the default (“all”) runner matrix with a fixed list of per-machine runner labels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T12:12:36Z

.github/workflows/gpu-load-test.yaml

+      - name: Kill all Docker containers and clean up workspace
        run: |
+          echo "=== Cleaning up containers on $(hostname) ==="
          containers=$(docker ps -q)
          if [ -n "$containers" ]; then
            docker kill $containers || true
          fi
-          docker rm -f ${{ env.CONTAINER_NAME }} 2>/dev/null || true
+          docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true


The cleanup step no longer removes an existing ${{ env.CONTAINER_NAME }} container. Since docker ps -q only lists running containers, a previously-stopped gpu_load_test container could remain and make the later docker run -dt --name ${{ env.CONTAINER_NAME }} fail with a name conflict. Consider explicitly docker rm -f ${{ env.CONTAINER_NAME }} (or removing all containers you kill) as part of this step.

Copilot · 2026-03-10T12:12:37Z

.github/workflows/gpu-load-test.yaml

          INPUT="${{ inputs.runners || 'all' }}"
          if [ "$INPUT" = "all" ]; then
-            MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"MI355-8GPU"}]'
+            MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]'


This workflow now uses per-machine runner labels for the default all case, but .github/runner-config.yml still only documents the old shared atom-mi355-8gpu.predownload label and indicates it should be updated when CI runner labels change. To keep the runner→hardware mapping (and any dependent dashboards) accurate, add entries for the new mia1-p01-g* labels there (or otherwise update the mapping source of truth).

Copilot · 2026-03-10T12:12:37Z

.github/scripts/gpu_load_test.sh

 MODEL_NAME="deepseek-ai/DeepSeek-R1-0528"
-MODEL_LOCAL_PATH="/data/deepseek-ai/DeepSeek-R1-0528"
+MODEL_LOCAL_PATH="/models/deepseek-ai/DeepSeek-R1-0528"
 TENSOR_PARALLEL=8
 KV_CACHE_DTYPE="fp8"
 TEMPERATURE=0


MODEL_LOCAL_PATH was changed to /models/..., but the script’s docker run later only bind-mounts /data into the container. If the model is found locally, MODEL_PATH will be a /models/... path that won’t exist inside the container, causing the inference command to fail. Either mount /models into the container as well, or add the same /models→/data fallback logic used in the GitHub Actions workflow.

grep -c outputs '0' to stdout then exits with code 1 when no matches are found. The '|| echo 0' fallback then outputs another '0', making LOAD_COUNT='0\n0' which breaks the integer comparison. Use '|| true' to suppress the error exit and default to 0 if empty.

Move all analysis logic into gpu_load_test.sh and remove redundant analysis step from the workflow. The script now handles GPU status check, model loading, timing analysis, and detailed diagnostics (including problem analysis and recommended actions for high variance).

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T14:53:47Z

.github/workflows/gpu-load-test.yaml

+      - name: Kill all Docker containers and clean up workspace
        run: |
+          echo "=== Cleaning up containers on $(hostname) ==="
          containers=$(docker ps -q)


docker kill $(docker ps -q) will terminate all containers running on the self-hosted runner, including unrelated long-lived services/containers that might be required on that machine. Please scope cleanup to containers created by this workflow (e.g., stop/rm only atom_inference, or filter by a workflow-specific label) rather than killing everything.

Suggested change

- name: Kill all Docker containers and clean up workspace

run: |

echo "=== Cleaning up containers on $(hostname) ==="

containers=$(docker ps -q)

- name: Kill workflow Docker containers and clean up workspace

run: |

echo "=== Cleaning up workflow containers on $(hostname) ==="

containers=$(docker ps -q --filter "name=atom_inference")

Copilot · 2026-03-10T14:53:47Z

.github/workflows/gpu-load-test.yaml

-          MOUNT_FLAGS=""
-          [ -d "/data" ] && MOUNT_FLAGS="$MOUNT_FLAGS -v /data:/data"
-          [ -d "/models" ] && MOUNT_FLAGS="$MOUNT_FLAGS -v /models:/models"
+          docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true


The workspace cleanup command rm -rf /workspace/* will not remove dotfiles (e.g., .git, .github, .cache), so state can leak between runs on self-hosted runners. Consider using a cleanup that also removes dotfiles (or rely on actions/checkout with clean: true/git clean -ffdx) so each run starts from a truly clean workspace.

Suggested change

docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" || true

docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/* /workspace/.[!.]* /workspace/..?*" || true

Copilot · 2026-03-10T14:53:47Z

.github/workflows/gpu-load-test.yaml

          INPUT="${{ inputs.runners || 'all' }}"
          if [ "$INPUT" = "all" ]; then
-            MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"MI355-8GPU"}]'
+            MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]'


This workflow now targets new runner labels (mia1-p01-g33, g34, etc.). .github/runner-config.yml explicitly says to update the runner→GPU mapping when runner labels in workflows change, but it currently only lists atom-mi355-8gpu.predownload and linux-atom-mi355-*. Please add entries for these new labels (or adjust the workflow to use existing mapped labels) so downstream tooling stays accurate.

Suggested change

MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]'

MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"atom-mi355-8gpu.predownload"},{"runner":"linux-atom-mi355-8gpu","label":"linux-atom-mi355-8gpu"}]'

Copilot · 2026-03-10T14:53:48Z

.github/scripts/gpu_load_test.sh

+    echo "═══════════════════════════════════════════════════════"
+    echo ""
+    echo "1. 🔄 RUN TEST AGAIN to see if same GPUs are slow:"
+    echo "   ./test_gpu_load.sh"


The diagnostic output suggests re-running ./test_gpu_load.sh, but this repository script is .github/scripts/gpu_load_test.sh (and the workflow executes that). Please update the printed command/path to the actual script name/location to avoid confusion during manual troubleshooting.

Suggested change

echo " ./test_gpu_load.sh"

echo " ./.github/scripts/gpu_load_test.sh"

Copilot · 2026-03-10T14:53:48Z

.github/scripts/gpu_load_test.sh


    # Identify slow GPUs
-    SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v min="$MIN_TIME" '{
+    SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v max="$MAX_TIME" -v min="$MIN_TIME" '{


awk is passed -v max="$MAX_TIME" here, but max is never referenced in the awk program. Please drop the unused variable to avoid confusion (or use it if it was intended for the threshold logic).

Suggested change

SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v max="$MAX_TIME" -v min="$MIN_TIME" '{

SLOW_GPUS=$(echo "$LOAD_DATA" | awk -v min="$MIN_TIME" '{

Use per-machine runner labels to test all nodes

18feab9

Copilot AI review requested due to automatic review settings March 10, 2026 09:55

Copilot started reviewing on behalf of gyohuangxin March 10, 2026 09:57 View session

gyohuangxin changed the title ~~Use per-machine runner labels for GPU load test~~ CI: Use per-machine runner labels for GPU load test Mar 10, 2026

Copilot AI reviewed Mar 10, 2026

View reviewed changes

gyohuangxin added 2 commits March 10, 2026 19:03

Add workspace cleanup with privileged container before checkout

a809a99

Update gpu_load_test.sh

6f752b8

Copilot AI review requested due to automatic review settings March 10, 2026 12:08

Copilot started reviewing on behalf of gyohuangxin March 10, 2026 12:09 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

gyohuangxin added 2 commits March 10, 2026 21:53

Copilot AI review requested due to automatic review settings March 10, 2026 14:47

Copilot started reviewing on behalf of gyohuangxin March 10, 2026 14:49 View session

Use /models as default model local path

f08d67e

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Update atom_test.sh

39e698b

gyohuangxin closed this Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Use per-machine runner labels for GPU load test#299

CI: Use per-machine runner labels for GPU load test#299
gyohuangxin wants to merge 7 commits intomainfrom
xihuang/gpu-load-test-workflow

gyohuangxin commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/*" \|\| true
	docker run --rm -v "${{ github.workspace }}":/workspace -w /workspace --privileged rocm/pytorch:latest bash -lc "ls -la /workspace/ && rm -rf /workspace/* /workspace/.[!.]* /workspace/..?*" \|\| true

	MATRIX='[{"runner":"mia1-p01-g33","label":"mia1-p01-g33"},{"runner":"mia1-p01-g34","label":"mia1-p01-g34"},{"runner":"mia1-p01-g40","label":"mia1-p01-g40"},{"runner":"mia1-p01-g42","label":"mia1-p01-g42"},{"runner":"mia1-p01-g45","label":"mia1-p01-g45"},{"runner":"mia1-p01-g64","label":"mia1-p01-g64"}]'
	MATRIX='[{"runner":"atom-mi355-8gpu.predownload","label":"atom-mi355-8gpu.predownload"},{"runner":"linux-atom-mi355-8gpu","label":"linux-atom-mi355-8gpu"}]'

	echo " ./test_gpu_load.sh"
	echo " ./.github/scripts/gpu_load_test.sh"

	SLOW_GPUS=$(echo "$LOAD_DATA" \| awk -v max="$MAX_TIME" -v min="$MIN_TIME" '{
	SLOW_GPUS=$(echo "$LOAD_DATA" \| awk -v min="$MIN_TIME" '{

Conversation

gyohuangxin commented Mar 10, 2026

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants